[BEAM-12448] Relax ping limits on the gRPC server by stefanistrate · Pull Request #15314 · apache/beam

stefanistrate · 2021-08-11T08:02:25Z

I find myself running longer jobs locally via the DirectRunner and I keep getting errors like this:

E0810 10:28:22.716036000 123145578897408 chttp2_transport.cc:1081]     Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
    target=lambda: self._read_inputs(elements_iterator),
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 570, in _read_inputs
    for elements in elements_iterator:
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 382, in __next__
    return self._next()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 374, in _next
    request = self._look_for_request()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 358, in _look_for_request
    _raise_rpc_error(self._state)
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_server.py", line 113, in _raise_rpc_error
    raise rpc_error
grpc.RpcError
Exception in thread read_grpc_client_inputs:
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
I0810 10:28:22.788511 123145827430400 local_job_service.py:340] Worker: severity: ERROR timestamp {   seconds: 1628587702   nanos: 724023103 } message: "Failed to read inputs in the data plane.\nTraceback (most recent call last):\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py\", line 57
0, in _read_inputs\n    for elements in elements_iterator:\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 416, in __next__\n    return self._next()\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 803, in _next\n    raise self\ngrpc._channel._MultiThread
edRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_string = \"{\"created\":\"@1628587702.717555000\",\"description\":\"Error received from peer ipv6:[::1]:57093\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1063,\"grpc_message\":\"Too many pi
ngs\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py\", line 570, in _read_inputs\n    for elements in elements_iterator:\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", lin
e 416, in __next__\n    return self._next()\n  File \"/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py\", line 803, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"Too many pings\"\n\tdebug_error_stri
ng = \"{\"created\":\"@1628587702.717555000\",\"description\":\"Error received from peer ipv6:[::1]:57093\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1063,\"grpc_message\":\"Too many pings\",\"grpc_status\":14}\"\n>\n" log_location: "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py:57
7" thread: "read_grpc_client_inputs"
    self.run()
  File "/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 587, in <lambda>
    target=lambda: self._read_inputs(elements_iterator),
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/apache_beam/runners/worker/data_plane.py", line 570, in _read_inputs
    for elements in elements_iterator:
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/Users/stefan/workspace/earth/.env/lib/python3.9/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Too many pings"
        debug_error_string = "{"created":"@1628587702.717555000","description":"Error received from peer ipv6:[::1]:57093","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Too many pings","grpc_status":14}"

I understand that the DirectRunner has its limitations and, in my case, I'm not using it for production jobs. However, it would be nice if some of the ping limits are relaxed a bit. Currently, the gRPC server has fairly strict limits:

GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA = 2
GRPC_ARG_HTTP2_MAX_PING_STRIKES = 2

My PR removes these 2 limits completely wherever grpc.server() is invoked (Python SDK only). I'm not sure this is the best approach, but I'm happy to adjust it based on your feedback.

Bonus request: it would be even nicer if Beam could allow passing options to the gRPC client & server via some flags.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

`ValidatesRunner` compliance status (on master branch)

Lang	ULR	Twister2
Go	---	---
Java
Python	---	---
XLang		---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

codecov · 2021-08-11T08:18:33Z

Codecov Report

Merging #15314 (1d0a16f) into master (2fd9875) will decrease coverage by 0.00%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master   #15314      +/-   ##
==========================================
- Coverage   83.81%   83.80%   -0.01%     
==========================================
  Files         441      444       +3     
  Lines       59745    60477     +732     
==========================================
+ Hits        50075    50683     +608     
- Misses       9670     9794     +124

Impacted Files	Coverage Δ
...ache_beam/runners/portability/local_job_service.py	`80.00% <50.00%> (-0.36%)`	⬇️
...e_beam/runners/portability/abstract_job_service.py	`83.70% <100.00%> (+0.09%)`	⬆️
...nners/portability/fn_api_runner/worker_handlers.py	`79.44% <100.00%> (ø)`
...thon/apache_beam/runners/worker/channel_factory.py	`75.00% <100.00%> (ø)`
...hon/apache_beam/runners/worker/worker_pool_main.py	`59.25% <100.00%> (+0.50%)`	⬆️
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py	`56.81% <0.00%> (-12.95%)`	⬇️
...ks/python/apache_beam/runners/worker/data_plane.py	`87.50% <0.00%> (-3.11%)`	⬇️
...ache_beam/runners/interactive/recording_manager.py	`96.55% <0.00%> (-2.43%)`	⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py	`94.02% <0.00%> (-2.24%)`	⬇️
sdks/python/apache_beam/internal/metrics/metric.py	`90.00% <0.00%> (-1.49%)`	⬇️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fd9875...1d0a16f. Read the comment docs.

tvalentyn · 2021-08-31T21:15:47Z

hi @angoenka - could you please take a look at this change?
@stefanistrate - you'd need to fix autoformatter warnings on this change eventually, see: https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks

aaltay · 2021-08-31T22:59:22Z

I missed this review request earlier. Sorry for the delay and thank you for the contribution.

ibzib · 2021-09-10T23:30:46Z

R: @ibzib

stefanistrate · 2021-10-05T15:18:02Z

I've just run yapf on my code to fix formatting issues. Those tests are passing now.

tvalentyn · 2021-10-05T18:05:40Z

07:38:58 The command '/bin/sh -c mkdir -p /usr/local/gcloud &&     cd /usr/local/gcloud &&     curl -s -O https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz &&     tar -xf google-cloud-sdk.tar.gz &&     /usr/local/gcloud/google-cloud-sdk/install.sh &&     rm google-cloud-sdk.tar.gz' returned a non-zero code: 1
07:38:58 
07:38:58 > Task :sdks:python:container:py36:docker FAILED
07:39:00

perhaps it's a flake, let's retry.

tvalentyn · 2021-10-05T18:05:48Z

Run Portable_Python PreCommit

ibzib

LGTM

Fix "too many pings" errors.

addc398

Increase keepalive timeout to 5 minutes.

5d50ab9

pabloem requested a review from tvalentyn August 19, 2021 17:44

aaltay requested a review from angoenka August 31, 2021 22:58

ibzib changed the title ~~Relax ping limits on the gRPC server~~ [BEAM-12448] Relax ping limits on the gRPC server Sep 10, 2021

aaltay requested a review from ibzib September 23, 2021 19:04

Fix yapf complaints.

1d0a16f

ibzib approved these changes Oct 6, 2021

View reviewed changes

ibzib merged commit fa7065e into apache:master Oct 6, 2021

hanneshapke mentioned this pull request Oct 28, 2021

Beam worker fails with "ENHANCE_YOUR_CALM" in TFX 1.0.0rc1 tensorflow/tfx#3961

Closed

shunping mentioned this pull request Dec 5, 2025

Fix too-many-pings on FnAPI runner under grpc mode #37013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-12448] Relax ping limits on the gRPC server#15314

[BEAM-12448] Relax ping limits on the gRPC server#15314
ibzib merged 3 commits intoapache:masterfrom
stefanistrate:stefan-fix-too-many-pings

stefanistrate commented Aug 11, 2021

Uh oh!

codecov bot commented Aug 11, 2021 •

edited

Loading

Uh oh!

tvalentyn commented Aug 31, 2021

Uh oh!

aaltay commented Aug 31, 2021

Uh oh!

ibzib commented Sep 10, 2021

Uh oh!

stefanistrate commented Oct 5, 2021

Uh oh!

tvalentyn commented Oct 5, 2021

Uh oh!

tvalentyn commented Oct 5, 2021

Uh oh!

ibzib left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

stefanistrate commented Aug 11, 2021

ValidatesRunner compliance status (on master branch)

Examples testing status on various runners

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Uh oh!

codecov bot commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tvalentyn commented Aug 31, 2021

Uh oh!

aaltay commented Aug 31, 2021

Uh oh!

ibzib commented Sep 10, 2021

Uh oh!

stefanistrate commented Oct 5, 2021

Uh oh!

tvalentyn commented Oct 5, 2021

Uh oh!

tvalentyn commented Oct 5, 2021

Uh oh!

ibzib left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`ValidatesRunner` compliance status (on master branch)

codecov bot commented Aug 11, 2021 •

edited

Loading